On Modeling Protein Superfamilies with Low Primary Sequence Conservation
نویسندگان
چکیده
Motivation: Development of tools for identification of new thioredoxin-fold proteins as well as other proteins belonging to superfamilies with low primary sequence conservation. Results: We present several algorithms for identifying thioredoxin (Trx)-fold proteins containing a conserved CxxC motif (two cysteines separated by two residues). The low conservation of primary sequence in this protein superfamily makes conventional methods difficult to use. Therefore, we use structural properties to build our classifiers. These structural properties include secondary structure patterns as well as various properties of the residues in the protein sequences. We use this information to model Trx-fold proteins via hidden Markov models, decision trees, and algorithms in the multipleinstance learning model. In 9-fold and 12-fold jack-knife tests, some of our models performed quite well, with high true positive and true negative rates. In addition, By combining a small number of our classifiers, we can identify 100% of the Trx-fold proteins in these jack-knife tests with moderate false positive rates. We also identified several candidate Trx-fold proteins in the C. jejuni, M. jannaschii, E. coli and S. cerevisiae genomes. Since our techniques are very general, they should be applicable to other superfamilies with low primary sequence conservation. Availability: C code available via email from contact author. Contact: Stephen Scott, Dept. of Computer Science, 115 Ferguson Hall, University of Nebraska, Lincoln, NE 68588-0115, USA, [email protected], (402) 472-6994, fax: (402) 472-7767
منابع مشابه
A Study in Modeling Low-Conservation Protein Superfamilies
We present several algorithms for identification of new proteins in superfamilies with low primary sequence conservation. The low conservation of primary sequence in protein superfamilies such as Thioredoxin-fold (Trx-fold) makes conventional methods such as hidden Markov models (HMMs) difficult to use. Therefore, we use structural properties to build our classifiers. These structural propertie...
متن کاملA Fast Algorithm for Gmil and Its Application to Protein Super-family Identification
We develop an algorithm for a generalization of the multiple-instance learning model in which a bag's label is not based on a single instance's proximity to a single target point. Rather, a bag is positive if and only if it contains a collection of instances, each near one of a set of target points. This algorithm is significantly faster in practice than others in this model. We applied our alg...
متن کاملSMoS: a database of structural motifs of protein superfamilies.
The Structural Motifs of Superfamilies (SMoS) database provides information about the structural motifs of aligned protein domain superfamilies. Such motifs among structurally aligned multiple members of protein superfamilies are recognized by the conservation of amino acid preference and solvent inaccessibility and are examined for the conservation of other features like secondary structural c...
متن کاملRebelling for a Reason: Protein Structural “Outliers”
Analysis of structural variation in domain superfamilies can reveal constraints in protein evolution which aids protein structure prediction and classification. Structure-based sequence alignment of distantly related proteins, organized in PASS2 database, provides clues about structurally conserved regions among different functional families. Some superfamily members show large structural diffe...
متن کاملIn Silico Analysis of Primary Sequence and Tertiary Structure of Lepidium Draba Peroxidase
Peroxidase enzymes are vastly applicable in industry and diagnosiss. Recently, we introduced a new kind of peroxidase gene from Lepidium draba (LDP). According to protein multiple sequence alignment results, LDP had 93% similarity and 88.96% identity with horseradish peroxidase C1A (HRP C1A). In the current study we employed in silico tools to determine, to which group of peroxidase enzymes LDP...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003